AITopics | computing node

Collaborating Authors

computing node

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

Doostmohammadian, Mohammadreza, Gabidullina, Zulfiya R., Rabiee, Hamid R.

arXiv.org Artificial IntelligenceOct-30-2025

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine learning (ML) and optimization is considered in this paper. Given a set of data distributed over a network of computing-nodes/servers, the idea is to optimally assign the CPU (central processing unit) usage while simultaneously training each computing node locally via its own share of data. This formulates the problem as a co-optimization setup to (i) optimize the data processing and (ii) optimally allocate the computing resources. The information-sharing network among the nodes might be time-varying, but with balanced weights to ensure consensus-type convergence of the algorithm. The algorithm is all-time feasible, which implies that the computing resource-demand balance constraint holds at all iterations of the proposed solution. Moreover, the solution allows addressing possible log-scale quantization over the information-sharing channels to exchange log-quantized data. For some example applications, distributed support-vector-machine (SVM) and regression are considered as the ML training models. Results from perturbation theory, along with Lyapunov stability and eigen-spectrum analysis, are used to prove the convergence towards the optimal case. As compared to existing CPU scheduling solutions, the proposed algorithm improves the cost optimality gap by more than $50\%$.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.25176

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > District of Columbia > Washington (0.04)
(4 more...)

Genre:

Instructional Material (0.68)
Research Report (0.64)

Industry:

Energy > Power Industry (0.93)
Information Technology (0.88)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)

Add feedback

ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System

Sun, Yongqian, Pan, Xijie, Xiong, Xiao, Tao, Lei, Wang, Jiaju, Zhang, Shenglin, Yuan, Yuan, Li, Yuqi, Jian, Kunlin

arXiv.org Artificial IntelligenceSep-23-2025

Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.

data mining, machine learning, node, (19 more...)

arXiv.org Artificial Intelligence

2506.20673

Country:

North America > United States (0.14)
Europe > United Kingdom (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.47)
Telecommunications (0.47)
Information Technology (0.46)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.68)
(2 more...)

Add feedback

Research on Edge Computing and Cloud Collaborative Resource Scheduling Optimization Based on Deep Reinforcement Learning

Wang, Yuqing, Yang, Xiao

arXiv.org Artificial IntelligenceFeb-25-2025

This study addresses the challenge of resource scheduling optimization in edge-cloud collaborative computing using deep reinforcement learning (DRL). The proposed DRL-based approach improves task processing efficiency, reduces overall processing time, enhances resource utilization, and effectively controls task migrations. Experimental results demonstrate the superiority of DRL over traditional scheduling algorithms, particularly in managing complex task allocation, dynamic workloads, and multiple resource constraints. Despite its advantages, further improvements are needed to enhance learning efficiency, reduce training time, and address convergence issues. Future research should focus on increasing the algorithm's fault tolerance to handle more complex and uncertain scheduling scenarios, thereby advancing the intelligence and efficiency of edge-cloud computing systems.

algorithm, processing time, reinforcement, (13 more...)

arXiv.org Artificial Intelligence

2502.18773

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Beyond Model Scale Limits: End-Edge-Cloud Federated Learning with Self-Rectified Knowledge Agglomeration

Wu, Zhiyuan, Sun, Sheng, Wang, Yuwei, Liu, Min, Xu, Ke, Pan, Quyang, Gao, Bo, Wen, Tian

arXiv.org Artificial IntelligenceDec-31-2024

The rise of End-Edge-Cloud Collaboration (EECC) offers a promising paradigm for Artificial Intelligence (AI) model training across end devices, edge servers, and cloud data centers, providing enhanced reliability and reduced latency. Hierarchical Federated Learning (HFL) can benefit from this paradigm by enabling multi-tier model aggregation across distributed computing nodes. However, the potential of HFL is significantly constrained by the inherent heterogeneity and dynamic characteristics of EECC environments. Specifically, the uniform model structure bounded by the least powerful end device across all computing nodes imposes a performance bottleneck. Meanwhile, coupled heterogeneity in data distributions and resource capabilities across tiers disrupts hierarchical knowledge transfer, leading to biased updates and degraded performance. Furthermore, the mobility and fluctuating connectivity of computing nodes in EECC environments introduce complexities in dynamic node migration, further compromising the robustness of the training process. To address multiple challenges within a unified framework, we propose End-Edge-Cloud Federated Learning with Self-Rectified Knowledge Agglomeration (FedEEC), which is a novel EECC-empowered FL framework that allows the trained models from end, edge, to cloud to grow larger in size and stronger in generalization ability. FedEEC introduces two key innovations: (1) Bridge Sample Based Online Distillation Protocol (BSBODP), which enables knowledge transfer between neighboring nodes through generated bridge samples, and (2) Self-Knowledge Rectification (SKR), which refines the transferred knowledge to prevent suboptimal cloud model optimization. The proposed framework effectively handles both cross-tier resource heterogeneity and effective knowledge transfer between neighboring nodes, while satisfying the migration-resilient requirements of EECC.

cloud computing, machine learning, node, (17 more...)

arXiv.org Artificial Intelligence

2501.00693

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Virginia (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education (0.93)

Technology:

Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Shai Shalev-Shwartz, Tong Zhang

Neural Information Processing SystemsOct-6-2024, 11:12:29 GMT

Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007].

algorithm, iteration, node, (13 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

Add feedback

Semantic Revolution from Communications to Orchestration for 6G: Challenges, Enablers, and Research Directions

Shokrnezhad, Masoud, Mazandarani, Hamidreza, Taleb, Tarik, Song, Jaeseung, Li, Richard

arXiv.org Artificial IntelligenceJun-24-2024

In the context of emerging 6G services, the realization of everything-to-everything interactions involving a myriad of physical and digital entities presents a crucial challenge. This challenge is exacerbated by resource scarcity in communication infrastructures, necessitating innovative solutions for effective service implementation. Exploring the potential of Semantic Communications (SemCom) to enhance point-to-point physical layer efficiency shows great promise in addressing this challenge. However, achieving efficient SemCom requires overcoming the significant hurdle of knowledge sharing between semantic decoders and encoders, particularly in the dynamic and non-stationary environment with stringent end-to-end quality requirements. To bridge this gap in existing literature, this paper introduces the Knowledge Base Management And Orchestration (KB-MANO) framework. Rooted in the concepts of Computing-Network Convergence (CNC) and lifelong learning, KB-MANO is crafted for the allocation of network and computing resources dedicated to updating and redistributing KBs across the system. The primary objective is to minimize the impact of knowledge management activities on actual service provisioning. A proof-of-concept is proposed to showcase the integration of KB-MANO with resource allocation in radio access networks. Finally, the paper offers insights into future research directions, emphasizing the transformative potential of semantic-oriented communication systems in the realm of 6G technology.

allocation, communication, transmission, (16 more...)

arXiv.org Artificial Intelligence

2407.00081

Country:

Europe > Finland > Northern Ostrobothnia > Oulu (0.04)
North America > United States (0.04)
Europe > Germany (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry:

Energy (1.00)
Telecommunications (0.66)
Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.34)

Add feedback

Disttack: Graph Adversarial Attacks Toward Distributed GNN Training

Zhang, Yuxiang, Liu, Xin, Wu, Meng, Yan, Wei, Yan, Mingyu, Ye, Xiaochun, Fan, Dongrui

arXiv.org Artificial IntelligenceMay-10-2024

Graph Neural Networks (GNNs) have emerged as potent models for graph learning. Distributing the training process across multiple computing nodes is the most promising solution to address the challenges of ever-growing real-world graphs. However, current adversarial attack methods on GNNs neglect the characteristics and applications of the distributed scenario, leading to suboptimal performance and inefficiency in attacking distributed GNN training. In this study, we introduce Disttack, the first framework of adversarial attacks for distributed GNN training that leverages the characteristics of frequent gradient updates in a distributed system. Specifically, Disttack corrupts distributed GNN training by injecting adversarial attacks into one single computing node. The attacked subgraphs are precisely perturbed to induce an abnormal gradient ascent in backpropagation, disrupting gradient synchronization between computing nodes and thus leading to a significant performance decline of the trained GNN. We evaluate Disttack on four large real-world graphs by attacking five widely adopted GNNs. Compared with the state-of-the-art attack method, experimental results demonstrate that Disttack amplifies the model accuracy degradation by 2.75 and achieves speedup by 17.33 on average while maintaining unnoticeability. Keywords: Graph Neural Network Distributed Training Adversarial Attack.

computing node, disttack, gnn training, (16 more...)

arXiv.org Artificial Intelligence

2405.06247

Country: Asia > China (0.04)

Genre: Research Report > New Finding (0.86)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Towards a Dynamic Future with Adaptable Computing and Network Convergence (ACNC)

Shokrnezhad, Masoud, Yu, Hao, Taleb, Tarik, Li, Richard, Lee, Kyunghan, Song, Jaeseung, Westphal, Cedric

arXiv.org Artificial IntelligenceMar-12-2024

In the context of advancing 6G, a substantial paradigm shift is anticipated, highlighting comprehensive everything-to-everything interactions characterized by numerous connections and stringent adherence to Quality of Service/Experience (QoS/E) prerequisites. The imminent challenge stems from resource scarcity, prompting a deliberate transition to Computing-Network Convergence (CNC) as an auspicious approach for joint resource orchestration. While CNC-based mechanisms have garnered attention, their effectiveness in realizing future services, particularly in use cases like the Metaverse, may encounter limitations due to the continually changing nature of users, services, and resources. Hence, this paper presents the concept of Adaptable CNC (ACNC) as an autonomous Machine Learning (ML)-aided mechanism crafted for the joint orchestration of computing and network resources, catering to dynamic and voluminous user requests with stringent requirements. ACNC encompasses two primary functionalities: state recognition and context detection. Given the intricate nature of the user-service-computing-network space, the paper employs dimension reduction to generate live, holistic, abstract system states in a hierarchical structure. To address the challenges posed by dynamic changes, Continual Learning (CL) is employed, classifying the system state into contexts controlled by dedicated ML agents, enabling them to operate efficiently. These two functionalities are intricately linked within a closed loop overseen by the End-to-End (E2E) orchestrator to allocate resources. The paper introduces the components of ACNC, proposes a Metaverse scenario to exemplify ACNC's role in resource provisioning with Segment Routing v6 (SRv6), outlines ACNC's workflow, details a numerical analysis for efficiency assessment, and concludes with discussions on relevant challenges and potential avenues for future research.

computing node, orchestrator, system state, (15 more...)

arXiv.org Artificial Intelligence

2403.07573

Country:

Europe > Finland > Northern Ostrobothnia > Oulu (0.05)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > China (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre:

Research Report (0.64)
Workflow (0.48)

Industry:

Information Technology (0.48)
Energy (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Deterministic Computing Power Networking: Architecture, Technologies and Prospects

Jia, Qingmin, Hu, Yujiao, Zhou, Xiaomao, Ma, Qianpiao, Guo, Kai, Zhang, Huayu, Xie, Renchao, Huang, Tao, Liu, Yunjie

arXiv.org Artificial IntelligenceJan-31-2024

With the development of new Internet services such as computation-intensive and delay-sensitive tasks, the traditional "Best Effort" network transmission mode has been greatly challenged. The network system is urgently required to provide end-to-end transmission determinacy and computing determinacy for new applications to ensure the safe and efficient operation of services. Based on the research of the convergence of computing and networking, a new network paradigm named deterministic computing power networking (Det-CPN) is proposed. In this article, we firstly introduce the research advance of computing power networking. And then the motivations and scenarios of Det-CPN are analyzed. Following that, we present the system architecture, technological capabilities, workflow as well as key technologies for Det-CPN. Finally, the challenges and future trends of Det-CPN are analyzed and discussed.

computing task, det-cpn, determinacy, (17 more...)

arXiv.org Artificial Intelligence

2401.17812

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology (1.00)
Telecommunications (0.95)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.69)

Add feedback

A Light-weight and Unsupervised Method for Near Real-time Behavioral Analysis using Operational Data Measurement

Vargis, Tom Richard, Ghiasvand, Siavash

arXiv.org Artificial IntelligenceJan-10-2024

Monitoring the status of large computing systems is essential to identify unexpected behavior and improve their performance and uptime. However, due to the large-scale and distributed design of such computing systems as well as a large number of monitoring parameters, automated monitoring methods should be applied. Such automatic monitoring methods should also have the ability to adapt themselves to the continuous changes in the computing system. In addition, they should be able to identify behavioral anomalies in useful time, to perform appropriate reactions. This work proposes a general lightweight and unsupervised method for near real-time anomaly detection using operational data measurement on large computing systems. The proposed model requires as little as 4 hours of data and 50 epochs for each training process to accurately resemble the behavioral pattern of computing systems.

computing system, light-weight and unsupervised method, prediction, (13 more...)

arXiv.org Artificial Intelligence

2402.05114

Country:

Europe > Germany > Saxony > Leipzig (0.06)
Europe > Germany > Saxony > Dresden (0.05)
Europe > Netherlands (0.05)
Europe > Greece (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.76)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.72)

Add feedback